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1.  Description  of  Progreai 


1.1.  Overview 


Hiis  rqxnt  describes  the  aooomplishments  of  the  ISIS  project  during  the  six  month  period 
February  •  August  198S.  We  assume  that  the  reader  is  familiar  with  the  goals  and  strategy  of  the 
project,  summarized  in  [1].  After  a  brief  summary,  we  discuss  areas  where  significant  progress 
has  been made  in  greater  detail. 

As  reported  previously,  we  completed  a  prototype  version  of  the  ISIS  system 
first  three  months  of  the  project.  This  software  transforms  fault-intolerant  single-site  program 
specifications  into  fault-tolerant  distributed  implementations,  and  supervises  execution  of  the 
resulting  code.  £>uring  the  second  three-month  period,  several  aspects  of  the  system  have  been 
enhanced:  the  interface  between  external  programs  and  resilient  objects,  the  language  used  to 
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Specify  resilient  objects,  and  the  command  language  used  to  control  the  system.  We  have  also 
designed  and  begun  construction  of  a  performance  monitoring  tool  and  some  application  software. 


Concurrency  is  the  key  to  good  performance  in  a  distributed  system:  the  less  syndironization 
employed  by  a  system,  the  less  frequently  it  wiQ  be  experience  delays  wfaQe  waiting  for  inter-site 
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message  transmissions  to  complete.  Recently,  we  achieved  a  basic  insight  into  the  nature  of  oon- 


currency  in  systems  like  ISIS^  This  has  lead-te-ua  to  redesign, the  ISIS  communication  primitives 
[5],  resulting  in  a  communication  subsystem  that  adueves  very  high  levels  of  concurrency,  but  at 
the  same  time  makes  it  easier  to  design  high-level  software  that  is  correct  in  the  presense  of 
failures.  The  development  of  these  primitives  will  probably  prove  to  be  eat  most  important 
aduevemeot  of  the  six-month  rqxnt  period.  By  achieving  high  levels  of  concurrency  while  simul¬ 
taneously  simplifying  concurrent  algorithms,  tiiey  represent  a  breakthrough  in  the  methodology 
for  developing  of  large,  fault-tolerant  systems. 


1.2.  ISIS  System  enhancementB 


Ihe  ISIS  prototype  is  now  largely  complete.  The  system  can  be  broadly  q)lit  into  several 
major  parts: 

1.  The  interface  presented  to  client  software  (normal  C  programs  executing  under  UNIX). 
Client  programs  treat  resilient  objects  as  the  glue  binding  programs  together  into  a  distri¬ 
buted,  fault-tolerant  application  system. 

2.  The  object  specification  language,  for  use  when  the  predefined  types  are  not  suitable  for 
some  application. 

3.  The  runtime  system,  which  provides  primitives  needed  by  resilient  objects  while  they  are 
executing. 

For  each  of  the  above  areas,  we  summarize  recent  work  and  status. 

The  “client-object’*  interface  permits  normal  UNIX  programs,  written  in  C,  to  issue  remote 
procedure  calls  to  ISIS  resilient  objects  using  a  remote  prooedure<all  interface.  If  desired,  multi¬ 
ple  tail*  to  ISIS  can  be  bundled  into  a  transaction  terminating  in  a  eomndt  or  abort  (abort  is  the 
default  if  a  client  fails  or  the  site  at  whidi  a  client  is  running  crashes).  Li  the  case  of  a  commit, 
changes  made  to  data  by  the  transaction  become  permanent;  in  the  case  of  an  abort,  changes  are 
discarded  and  no  otha  transaction  observes  data  in  an  intermediate  state. 

During  the  past  6  months,  software  to  support  the  interface  has  been  oonq)leted  and 
debugged.  For  example,  «wume  that  a  banking  program  has  been  inylcmented  as  a  front-end 
program  that  interacts  with  usen,  and  en^iloys  a  database  object  in  which  accounts  and  balances 
are  maintaitu^-  Fig.  1.  illustrates  a  fragment  of  code  that  the  front  end  might  use  to  contact  the 
database  and  update  it,  first  inserting  a  new  entry,  and  then  initiating  a  “rebalancing**  operation  to 
rebuild  secondary  indicies  for  faster  access.  The  rebalancing  is  done  asynchronously.  Ihe  exam¬ 
ple  first  ewii*  the  ISIS  name  service  to  look  up  the  database,  called  “dbase**,  and  then  invokes  the 
object  twice.  Note  that  the  RPC  syntax  is  a  relatively  transparent  one,  unlike  some  previous  RPC 
proposals  for  C.  Obviously,  the  procedure  is  free  to  use  any  other  C  statements  or  oonstiucts  that 
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are  needed.  Hie  example  does  not  use  the  BEdNO  and  COMMITO  routines,  hence  each  call 
executes  as  a  sqiarate  transaction.  We  have  built  test  objects  of  this  sort,  and  typical  calls  require 
about  a  tenth  of  a  second  to  return  a  result,  a  cost  which  is  indqiendeot  of  the  degree  to  which 
data  is  replicated  (  of  course,  the  update  does  not  finish  at  remote  sites  for  a  longer  period  of 
time,  which  does  depend  on  the  degree  of  replication).  Such  performance  is  more  than  adequate 
for  most  applications. 

Turning  to  the  object  specification  language,  a  number  of  extensions  have  been  made  to  the 
object  specification  language,  which  is  an  extended  version  of  C.  These  indude  a  cobegic  state¬ 
ment  for  concurrency,  a  toplevel  statement  (as  in  MTTs  ARGUS  language),  remote  procedure 
calls  to  other  objects  (including  asynchronous  ones),  dynamic  allocation  and  deallocation  of  resi¬ 
lient  records,  and  dynamically  specified  record  sizes. 

Debugging  an  ISIS  object  is  done  using  a  translator  that  converts  specifications  into  conven- 
danal  C  procedures  that  can  be  linked  to  calling  programs  and  debugged  using  normal  UNIX 


import  namespace,  dbase;  /*  Load  interface  definitions  */ 

capabOtty  namespace  NAM^;  /*  Predefined  capability  on  namespace  */ 

capabiBty  dbase  DB;  /*  Capability  on  the  dbase  object  */ 

struct  db.entry  dbregister(name,  date,  age,  creditjim) 

char  *name,  *date,  age; 
float  aeditjim; 

{ 

/*  Get  capability  on  dbase  object  */ 

DB  =  NAMRS$nmJind("dbase"); 

/*  First  register  the  new  entry  */ 

DB$db_register(name,  date,  age,  crediOim); 

/*  Initiate  asynchronous  data  structure  rebalancing  */ 

ASYNC  DBSdb.balanceO; 


/*  Return  the  result  to  die  caDer  */ 
retum(db_ent); 


Flgiira  1:  A  C  procedure  that  looks  op  and  then  cale  a 
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debugging  facilities.  This  way,  by  the  time  an  object  is  actually  instaUed  into  ISIS  it  will  be  rela* 
tively  bug-free. 

Figure  2  illustrates  the  nni^insert  and  nm^iind  procedures  in  the  name-space,  vdiich  manages 
a  directory  of  symbolic  names  and  capabilities  to  which  they  correspond  (nmjnsert  is  invoked  by 
the  system  when  a  new  instance  of  an  object  is  created).  The  name  space  is  rqnesented  as  a  list 
of  records,  which  are  scanned  sequentially  in  a  do-«4iile  loop.  The  basic  rule  is  that  there  is  only 
one  entry  per  name.  A  commit  flag  is  used  to  indicate  whether  the  entry  is  valid  or  invalid,  and 
an  invalid  entry  will  be  re-used  on  a  subsequent  nmjnsert  for  the  same  name.  Better  concurrency 
control  is  used  in  the  actual  namespace  object,  but  in  the  interests  of  brevity  we  have  simplified 
the  version  given  here.  The  basic  strategy  is  to  lock  each  record  before  anrrssing  it,  using  promot- 
able  read  locks  that  are  converted  to  write-locks  if  the  entry  is  actually  written.  If  the  transaction 
calling  the  nmjnsert  commits,  the  entry  it  has  written  becomes  visible  to  other  callers;  if  it  aborts, 
other  transactions  are  permitted  to  read  the  entry,  but  detect  that  its  commit  flag  b  dear  and 
hence  that  it  b  invalid. 

Most  C  programmers  should  be  able  to  write  object  specifications  of  thb  sort  -  the  I/O 
statement,  locking,  and  the  declaration  rules  are  the  only  things  that  dbtinguish  thb  code  from  a 
normal  C  program.  A  more  experienced  user  b,  of  course,  able  to  bufld  an  object  that  would  give 
better  or  more  concurrent  performance.  In  contrast,  only  a  few  experts  can  write  dbtributed 
fault-tolerant  programs  of  the  sort  that  thb  object  “compQes”  into,  and  even  fewa  could  hand- 
code  a  really  sophbticated  namespace  with  any  chance  of  obtaining  a  correct  result. 

Although  we  have  been  working  actively  within  the  system,  most  of  thb  effort  b  relatively 
technical  and  hence  we  d^er  a  detafled  discussion  to  a  planned  papa  on  the  ISIS  implementation. 
One  of  the  more  visible  qjpears  of  progress,  however,  involves  a  new  command  language  that 
greatly  increases  our  control  over  the  ISIS  system  wfafle  it  b  running.  Commands  allow  the  user 
to  dynamically  add  and  ddete  sites,  “load”  and  “onload”  types  as  they  are  needed,  create  and 
delete  objects,  list  the  resilient  objects  known  at  a  site,  etc.  Figure  3  illustrates  the  startup  file 


r  The  oame>space  is  a  list  of  oia_eatry  structures  */ 
typedef  struct  iint_eatry  nm_entry; 
struct  nnuentry 
{ 

int  nmjlag; 

char  nm_name{32]; 

cap_t  nnucap; 

}; 

#(tefme  NVLVALID  (hrfMOl  T  Entry  in  use  •/ 

#defme  NM_COMMrr  0x0002  r  Hia  been  committed  */ 

r  Definidan  of  the  namespace  object  */ 
resilient  namespace 

created  makeone;  /*  Lndalizes  durii^;  create  */ 

entry  nnulink,  nm_unlink,  nmjind;  /*  Namespace  routines  */ 

{ 

resilient  nm_entry  namesQ;  T  The  name-space  itself  */ 

proc  int 

nmJink(name,  cap) 
char  'name; 
capability  cap; 

{ 

raster  i,  n; 
nm^entry  ent; 

r  Look  for  eadsting  entry  for  this  name  */ 
n  “  0; 
do 
{ 

r  This  I/O  statement  locks  and  then  reads  the  n'th  namespace  record  */ 
ent  <-p  namesfn-*--*-]; 

r  A  match?  •/ 

if(stccmp(ent.nm.juune,  name)  »  0) 

if((ent.nmJlag&NM_COMMIT)  0) 
break; 
else 

r  Name  conflicts  with  a  previous  entry  */ 
abort  retum(>l); 

} 

while(entnnUlag&NM-VALlD): 

r  Found  entry  to  use  */ 
ent.nm_flag  “  NM-VALH^ 

r  Copy  name  and  capability  infonnadon  into  record  */ 
strcpy(em.nm_aame,  name); 
ent.nmjcap  ■  cap; 

r  Now  write  it  “provisionally”  with  the  commit  bit  set  */ 
ent.nm_flag  |«  N^COMMZT; 
names[n]  <*^  ent; 


used  to  configure  ISIS  at  site  “anubis”;  commands  like  these  can  also  be  issued  interactively  while 
ISIS  is  running.  The  configuration  file  defines  three  ISIS  site  by  giving  the  machine  names  and 
ARPANET  port  addresses  at  which  they  can  be  contacted  (as  offsets  from  a  base-port  number), 
then  defines  5  types  and  loads  one  of  them.  A  file  for  restart  information  is  then  defined,  and  if 
the  system  is  coldstarting,  the  namespace  object  is  instantiated.  Otherwise,  the  system  is  told  to 
restart  from  the  restart  file  written  prior  to  the  crash. 

The  overall  robustness  of  ISIS  is  steadily  increasing  in  response  to  a  continuing  program  of 
testing  and  developmem.  Performance  is  good  when  no  failures  occur,  particularly  because  of 
concurrent  update  techniques  which  we  describe  below.  Recovery  from  partial  failures,  in  which 
some  sites  remain  operational  has  been  implemented,  and  also  gives  good  performance.  Sdll 
needed  is  code  to  handle  recovery  from  total  failures  (all  sites  fail  at  once),  and  partitioning  (sites 
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ISIS  configuration  file  for  site  anubis 


r 

•/ 

/*  Site  numbering  and  machine  names  */ 
mysite  1  anubis 

site  2  osiris 

site  3  amun 

/•  INET  base  port  •/ 
baseport  1200 

/*  Type  definitions  and  corresponding  executable  files  */ 

typedef  0  names_t  ^sis/hin/namespace 

typedef  1  btree.t  ^sis/bin/btreeobj 

typedef  2  file.t  ^sis/bin/fileobj 

typedef  3  queue.t  /isis/bin/queueobj 

typedef  4  stack_t  ^sis/bin/stadcobj 

/*  Load  the  namespace.  Other  types  loaded  as  needed  */ 
load  names_t 

/*  Tell  ISIS  what  restart  file  to  use  */ 
restartfile=/anubis/restartjGle 

if  coldstart 

/*  Coldstart:  create  namespace  object  */ 
aeate  "names":  type=names_t,  sites={anubis  osiris  amun} 
else 

/*  Otherwise,  initiate  restart  sequence  */ 
restart 

endif 


Figure  3:  Fragment  of  an  ISIS  command  lOe 


lose  the  ability  to  communicate  with  one  another  but  do  not  fail).  Although  both  of  these  are 
relatively  rare  events  in  most  networks,  we  intend  to  address  them  eventually. 

U.  Perfbunance  monitor 

P.  Stephenson  has  designed  and  is  now  implementing  a  distributed  performanoe  monitoring 
program.  This  tool  could  be  used  in  any  distributed  system,  but  is  particularly  well  suited  to 
obtaining  performance  information  from  the  ISIS  system  wfaQe  it  runs.  The  program  is  table 
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driven,  and  employs  a  gr^hics  interface  to  the  SUN  window  system  for  output.  It  is  possible  to 
change  the  parts  of  the  system  being  monitored,  rqslay  activity  during  a  selected  period  of  time, 
focus  on  the  detailed  behavior  of  the  system  \^e  some  event  is  occurring,  plot  system  load  and 
performance,  and  so  forth.  Our  intention  is  to  use  the  tool  for  overall  tuning  and  (febugging,  par¬ 
ticularly  as  we  begin  developing  data  migration  algorithms  for  ISIS. 

1.4.  AppBcationf  software  devetopment 

With  the  completion  of  the  ISIS  system,  we  are  now  bepnning  to  focus  on  applications. 
There  will  be  more  to  report  in  this  area  in  the  near  future.  At  the  moment,  only  some  test 
software  and  a  distributed  game  program  are  operational.  One  of  our  long  term  ideas  is  to  port  a 
medical  database  system,  MDB-1,  onto  ISIS.  This  database  syston  is  available  to  us  because  of  a 
collaboration  with  medical  researchers  at  Columbia  University,  and  is  of  particular  interest  because 
its  modular  structure  is  tailored  to  a  resilient  object  environment. 

1.5.  Comnmnkatlon  priznittva 

K.  Birman  and  T.  Joseph  have  recently  completed  a  thorough  re-examination  of  the  com¬ 
munication  problem  in  ISIS,  focusing  on  the  communication  layer  of  the  system  and  the  ordering 
relationships  between  communication  events.  The  solution,  described  in  [5]  is  remarkable  in 
several  respects.  Fust,  it  preserves  a  very  high  level  of  concurrency,  which  in  a  distributed  system 
is  the  key  to  obtaining  good  performance.  Most  other  work  on  communication  primitives  has  not 
focused  on  this  issue.  Second,  the  ordering  of  events  seen  by  users  of  the  primitives  is  the  same 
at  all  sites  (although  not  “simultaneous”  in  the  sense  of  a  global  dock).  This  results  in  simplified 
high-level  algorithms,  inaeasing  the  level  of  confidence  that  can  be  placed  in  their  correctness.  In 
effect,  we  have  designed  a  set  of  communication  primitives  that  eliminates  the  need  for  most  syn¬ 
chronization  by  enabling  a  process  to  assume  that  other  processes  will  experience  the  same 
sequence  of  events  as  it  does,  unless  they  fail  first. 
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An  example  will  illustrate  the  class  of  problems  dtat  arise  here.  Gmsider  a  process  p  that  is 
t^xlating  a  replicated  data  item  maintained  by  a  set  of  data  managers.  Assume  that  this  update  is 
performed  using  a  reliable  broadcast:  if  any  data  manager  receives  the  broadcast  and  remain* 
operational,  all  data  managers  will  recdve  it.  If  p  fails,  a  data  manager  could  observe  any  of 
several  outcomes: 

1.  The  update  is  received  prior  to  detection  of  the  failure. 

2.  The  failure  is  detected  prior  to  reception  of  the  update. 

3.  The  failure  is  detected  and  the  update  is  not  delivered  (anywhere). 

It  may  be  difficult  for  a  data  manager  to  distinguish  cases  2  and  3.  Moreover,  if  some 
managers  experience  the  first  outcome  and  others  the  second  one,  the  overall  system  must  still  be 
correct.  There  are  several  ways  that  these  problems  might  be  addressed.  By  performing  updates 
using  a  two-phase  commit,  agreement  can  be  reached  on  the  action  to  take  after  a  failure  is 
detected  [Skeen*a].  This  approach  could  be  slow  because  it  is  synchronous.  Another  possibility  is 
to  discard  messages  arriving  from  a  process  that  has  failed.  However,  inconsistencies  may  arise  if 
messages  are  discarded  by  one  process  but  retained  by  another.  A  third  alternative,  representative 
of  general  approach  of  our  work,  is  to  construct  a  broadcast  protocol  in  which  the  second  out¬ 
come  never  occurs.  Using  the  ISIS  communication  primitives,  a  data  manager  can  perform  an 
update  immediately  upon  receiving  the  corresponding  message,  and  can  take  a  recovery  action 
immediately  after  detecting  a  failure;  moreover,  every  data  manager  experiences  the  same 
sequence  of  events,  or  fails  first. 

What  types  of  events  are  we  including?  We  have  distinguished  three  kinds  “broadcasts” 
which  are  issued  by  one  process  to  a  set  of  destination  processes,  aU  having  the  property  that  they 
are  delivered  to  every  operational  destination  or  none,  regardless  of  failures^  The  three  types  are: 

‘Although  the  broadcasts  are  process-toprocess,  processes  can  reside  on  different  sites.  These  are  thus  more 
powerful  than  site-to-site  broadcasts.  Note  that  the  term  “bro«kast"  is  used  here  to  mnm  a  software  message- 
transmission  protocol,  not  an  ethemet  broadcast  (although  such  a  hardware  feature  might  be  useful  when  inylementing 
some  of  our  protocols). 


1.  Independently  issued  broadcasts  that  should  be  delivered  in  a  consistent  order  to  any  over¬ 
lapping  destinations. 

2.  Related  broadcasts  which  must  be  performed  in  the  order  they  were  issued. 

3.  Broadcasts  used  to  notify  processes  of  failures  and  recoveries  of  other  processes  in  their 
“process  group”. 

Our  protocols  enable  processes  to  employ  consistent  strategies  when  processing  messages  and 
reacting  to  failure  or  recoveries,  without  using  any  special  protocols  to  decide  what  to  do.  More¬ 
over,  “race  conditions”  and  other  anomalies  causes  by  unpredictable  message  orderings  are  are 
ruled  out.  A  more  thorough  discussion,  together  with  examples  illustrating  the  extent  to  which 
the  approach  simplifies  high-level  algorithms  appears  in  [5],  which  is  being  sent  under  separate 
cover. 

T.  Joseph  has  developed  a  model  within  which  this  problem  can  be  studied  formally  [6].  He 
has  found  that  the  technique  (and  hence  our  primitives)  would  be  useful  in  almost  any  correct 
message-based  system.  In  future  operating  systems,  we  believe  that  communication  primitives 
such  as  these  will  be  critical  to  good  performance,  and  the  key  to  the  development  of  correct, 
fault-tolerant  distributed  software. 

2.  Sammary  of  trips  and  vlslta 

K.  Birman  visited  the  University  of  Texas  at  Austin  and  the  NASA  Johnson  ^jace  Center, 
where  he  gave  seminars  titled  “An  Overview  of  the  ISIS  Project”.  He  was  accompanied  by  T. 
Joseph,  who  will  remain  with  the  project  as  a  Research  Associate  in  the  fall,  and  who  spoke  about 
his  work  on  concurrency  in  message-based  systems.  Birman  also  visited  with  Bill  Joy  at  SUN 
Nficrosystems  in  California  regarding  a  paper  on  ISIS  [4]  and  visited  the  systems  research  center  at 
IBM  in  Palo  Alto  (only  minor  local  expenses  were  charged  to  the  grant  for  this  trip).  Birman  and 
several  students  also  attended  the  ACM  PODC  conference  at  the  end  of  this  reporting  period. 
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3.  Status  relative  to  planned  effort 

Research  is  underway  in  all  major  areas  of  the  project. 

4.  Fiscal  statns. 

A  summary  of  expenditures  is  attached. 
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